The goal of this notebook is to introduce you to a new publicly-available, open-access dataset in BigQuery. This set of BigQuery tables was produced by the ISB-CGC project, based on the open-access TCGA data available at the TCGA Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot. (You can find more information about this NCI-funded program here.)
We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are links to some resources that you might find useful:
There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).
In order to work with BigQuery, the first thing you need to do is import the gcp.bigquery package:
In [6]:
import gcp.bigquery as bq
The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we are introducing in this notebook are in a dataset called tcga_201607_beta
, owned by the isb-cgc
project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>
. Let's start by getting some basic information about the tables in this dataset:
In [7]:
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
print '%10d rows %12d bytes %s' \
% (t.metadata.rows, t.metadata.size, t.name.table_id)
These tables are based on the open-access TCGA data as of July 2016. The molecular data is all "Level 3" data, and is divided according to platform/pipeline. See here for additional details regarding the TCGA data levels and data types.
Additional notebooks go into each of these tables in more detail, but here is an overview, in the same alphabetical order that they are listed in above and in the BigQuery web UI:
We suggest that you start with the two "Creating TCGA cohorts" notebooks (part 1 and part 2) which describe and make use of the Clinical and Biospecimen tables. From there you can delve into the various molecular data tables as well as the Annotations table. For now these sample notebooks are intentionally relatively simple and do not do any analysis that integrates data from multiple tables but once you have a grasp of how to use the data, developing your own more complex analyses should not be difficult. You could even contribute an example back to our github repository! You are also welcome to submit bug reports, comments, and feature-requests as github issues.
You may be used to thinking about a molecular data table such as a gene-expression table as a matrix where the rows are genes and the columns are samples (or vice versa). These BigQuery tables instead use the tidy data approach, with each "cell" from the traditional data-matrix becoming a single row in the BigQuery table. A 10,000 gene x 500 sample matrix would therefore become a 5,000,000 row BigQuery table.